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Abstract — Not only can online topic modeling algorithms extract topics from big data streams with constant memory requirements, but 
also can detect topic shifts as the data stream flows. Fast convergence speed is a desired property for batch learning topic models such 
as latent Dirichlet allocation (LDA), which can further facilitate developing fast online topic modeling algorithms for big data streams. 
In this paper, we present a novel and easy-to-implement fast belief propagation (FBP) algorithm to accelerate the convergence speed 
for batch learning LDA when the number of topics is large. FBP uses a dynamic scheduling scheme for asynchronous message 
passing, which passes only the most important subset of topic messages at each iteration for fast speed. From FBP, we derive an 
online belief propagation (OBP) algorithm that infers the topic distribution from the previously unseen documents incrementally by the 
online gradient descent. We show that OBP can converge to the local optimum of the LDA objective function within the online stochastic 
optimization framework. Extensive empirical studies demonstrate that OBP significantly reduces the learning time and achieves a much 
lower predictive perplexity when compared with that of several state-of-the-art online algorithms for LDA, including online variational 
Bayes (OVB) and online Gibbs sampling (OGS) algorithms. 

Index Terms — Latent Dirichlet allocation, topic models, online belief propagation, convergence speed, online Gibbs sampling, online 
variational Bayes. 
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1 Introduction 

Probabilistic topic modeling (TJ is an important problem 
in machine learning and data mining. As one of the 
simplest topic modeling algorithms, the batch latent 
Dirichlet allocation (LDA) [2[ algorithm has to sweep 
repeatedly the entire data set until convergence, which 
can be broadly categorized into three strategies: varia- 
tional Bayes (VB) [2 J, collapsed Gibbs sampling (GS) [3[ 
and loopy belief propagation (BP) 

We can interpret VB, GS and BP as message passing 
algorithms, which infer the posterior distribution of topic 
label for each word called message, and estimate parame- 
ters by the iterative expectation-maximization (EM) algo- 
rithm according to the maximum-likelihood criterion [ 5 1 . 
These EM algorithms mainly differ in the E-step for mes- 
sage update equations. For example, VB is a synchronous 
variational message passing algorithm [6], which up- 
dates the variational messages by complicated digamma 
functions that slow down the learning speed [4|, |7|. In 
contrast, GS updates messages by discrete topic labels 
randomly sampled from the messages in the previous 
iteration. Obviously, the sampling operation does not 
keep all uncertainties encoded in the previous messages. 
In addition, such a Markov chain Monte Carlo (MCMC) 
sampling process often requires more iterations until 
convergence. Without sampling from messages, BP di- 
rectly uses the previous messages to update the current 
messages. Such a deterministic process often takes a 
significantly less number of iterations than GS to achieve 
convergence. According to a recent comparison [4], VB 
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requires around 100 iterations, GS takes around 300 
iterations and synchronous BP (sBP) needs around 170 
iterations to achieve convergence in terms of training 
perplexity [2[, which is a performance measure to com- 
pare different leaning algorithms of LDA (7). 

However, the batch LDA algorithm has a high time 
complexity for big data sets. For example, VB [2J requires 
around 7 days to scan 8, 200, 000 PUBMED documents 
when the number of iterations T = 100 and the number 
of topics K = 100. In addition, batch LDA algorithms 
usually have a high space complexity scaling linearly 
with the number of documents D. For example, when 
D = 10 7 , VB cannot handle the big data set on a 
common desktop computer with 4G memory. To pro- 
cess big data sets, online [8|-[15| and parallel [16J-[21J 
LDA algorithms have been two widely used strategies. 
Since parallel learning algorithms depend on expensive 
parallel hardware and their space complexity still scale 
linearly with the number of documents D, in this paper, 
we study online LDA algorithms that require only a 
constant memory usage to detect topic distribution shifts 
as the big data stream flows. 

Online LDA algorithms partition the entire D doc- 
uments into M small mini-batches with size S, and 
use the online gradient produced by each mini-batch to 
estimate topic distributions sequentially. Each mini-batch 
is discarded from the memory after one look. So, the 
memory cost scales linearly with the mini-batch size S, 
where S is a fixed number provided by users and S <C D. 
Because the online gradient computation for each mini- 
batch requires a significantly less number of iterations 
until convergence |22|. online algorithms are usually 
faster by a factor of 5 than batch algorithms. Current 
online LDA algorithms are derived from the batch coun- 
terparts like GS and VB, e.g., online GS (OGS) 151-1121 
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and online VB (OVB) ||T3l - fT5l - The convergence speed 
for the online gradient descent computation determines 
the efficiency of online LDA algorithms. For example, 
OVB [13] does not start optimizing the next mini-batch 
until the previous mini-batch achieves convergence. So, 
the faster convergence speed of batch algorithms will 
lead to the faster online algorithms for big data streams. 

In this paper, we present a novel fast belief propaga- 
tion (FBP) algorithm to accelerate the convergence speed 
for batch learning LDA. Compared with sBP [4J, FBP 
uses an informed scheduling strategy for asynchronous 
message passing, in which it efficiently influences those 
slow convergent messages by passing the fast conver- 
gent messages with a higher priority. Through dynami- 
cally scheduling the order of message passing based on 
the residuals of two messages resulted from successive 
iterations, FBP converges significantly faster and more 
often than general sBP on cluster or factor graphs |23|. 
Because the topic messages are usually sparse |10|, 
1241 , FBP selects and passes only the important subset 
of topic messages according to the convergence speed. 
Therefore, FBP runs fast in case of the large number of 
topics. From FBP, we derive the online belief propagation 
(OBP) to compute the online gradient descent of topic 
distribution based on the previously unseen mini-batch 
incrementally. More specifically, OBP combines FBP with 
the stochastic gradient descent algorithm l22l , which 
ensures that OBP can converge to the stationary point of 
the LDA joint probability by a series of online gradient 
updates. Experiments on four big data streams confirm 
that OBP is not only faster but also more accurate than 
several state-of-the-art online LDA algorithms such as 
OGS 1101 and OVB IT3l. 

This paper is organized as follows. Section presents 
the FBP algorithm having the fast convergence speed 
for learning LDA. Section [3] derives OBP from FBP, and 
demonstrates that OBP is a stochastic gradient descent 
algorithm converging to a stationary point. Section [4] 
compares FBP and OBP with several state-of-the-art 
batch and online LDA algorithms on four real-world 
text corpora. Finally, Section [5] draws conclusions and 
envisions future work. 

2 Fast Belief Propagation 

In this section, we begin by briefly reviewing the BP 
algorithm for learning the collapsed LDA |4J. The prob- 
abilistic topic modeling task can be interpreted as a 
labeling problem, in which the objective is to assign a set 
of thematic topic labels, zwxd = d }, to explain the 
observed elements in document-word matrix, 'x.wy.D = 
{xw,d}- The notations 1 < w < W and 1 < d < D are 
the word index in vocabulary and the document index 
in corpus. The notation 1 < k < K is the topic index. 
The nonzero element x Wt d ^ denotes the number of 
word counts at the index {w, d}. For each word token, 
there is a topic label 



,i = {0,l},£f =1 4 i(M = l,l< 
< %w,d, so that the topic label for the word index 



{w,d} is 4,d = Yh=i z t,d,d x w,d- After integrating out 
the document-specific topic proportions 9d{k) and topic 
distribution over vocabulary words <fi w (k) in LDA, we 
obtain the joint probability of the collapsed LDA 
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where T(-) is the gamma function, and {a,/3} are fixed 
symmetric Dirichlet hyperparameters [|3). 

The sBP algorithm [4[ computes the posterior proba- 
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message, < )i Wt d(k) < 1, which can be normalized using 
a local computation, i.e., Ylk=i ^w,d{k) = 1. The message 
update equation is 



Vw,d(k) oc [9- w> d(k) +a]x 



where 
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where — w and — d denote all word indices except w 
and all document indices except d. Obviously, the mes- 
sage update equation I0 depends on all other messages 
-d excluding the current message fi Wt d- After the 
normalized messages converge, the document-specific 
topic proportion 9 and the topic distribution over the 
fixed vocabulary <fi can be estimated as 
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The synchronous schedule f s updates all messages 10 
in parallel simultaneously at iteration t based on the 
messages at previous iteration t — 1: 



1,1) 



{fin^U), . • . , /(^-J.-d) • • ■ , ttH-w,- D )}, (7) 

where f s is the message update function 10 and fJ,_ w _ d 
is all set of messages excluding fi Wt d- The asynchronous 
schedule f a updates the message of each variable in 
a certain order, which is in turn used to update other 
neighboring messages immediately at each iteration t: 

f {^i i , ■ • ■ , Hw,d) = 

{^...Ji^-^),...,^}, (8) 

where the message update equation / is applied to each 
message one at a time in some order. 
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2.1 Speed Up Convergence 

The basic idea of FBP for LDA is to select the descending 
update order {0 based on the messages' residuals r Wj d, 
which are defined as the p-norm of difference between 
two message vectors at successive iterations [23|, 

r w , d (k) = x Wtd \\f/ Wtd (k) - (9) 

where x w> d is the number of word counts. For simplicity, 
we choose the L\ norm with p = 1. In practice, the 
computational cost of sorting |9j is very high because we 
need to sort all non-zero residuals r w , d in the document- 
word matrix at each learning iteration. Obviously, this 
scheduling cost is expensive in case of big data sets. 
Alternatively, we may accumulate residuals based on the 
document index, 

r d (k) =Y / r w ,d(k), (10) 

W 

r d = ^2r d (k). (11) 

k 

These residuals can be computed during the message 
passing process at a negligible computational cost. Also, 
the time complexity of sorting dTOb and dTTb in the de- 
scending order are at most 0(K log K) and 0(D log D), 
respectively. As a result, the sorting time is short during 
each learning iteration. 

Here, we demonstrate that the FBP algorithm has a 
faster convergence rate than sBP [4|. We assume that 
the message update equation / in ((2) is a contraction 
function under some norm [23], so that 

Um'-mI ^tIIm 1-1 -/*!, (12) 

for some global contraction factor < 7 < 1. Eq. ((12} 
guarantees that the messages fi* = {/z* 1; . . . , ^ w D } will 
converge to a fixed-point fi* = {fix±, . . . , Hwd) m me 
synchronous schedule 0. This assumption often holds 
true for sBP in learning the collapsed LDA [4] based on 
the cluster or factor graphical representation. According 
to [23 1/ me asynchronous schedule (|8) will also converge 
to a fixed-point fi* if / is a contraction mapping and for 
each message fi Wt d(k), there is a finite time interval, so 
that the message update equation / is executed at least 
once in this time interval. As a result, the FBP algorithm 
will converge to a fixed-point fi* as sBP. 

To speed up convergence in the asynchronous sched- 
ule ©, we choose to update the message fi w ,d to mini- 
mize the largest distance \\^ m d — ^ d || first. However, we 
cannot directly measure the distance between a current 
message and its unknown fixed-point value. Alterna- 
tively, we can derive a bound on this distance that can be 
calculated easily. Using the triangle inequality, we obtain 

< + ll/4i - /4,J 

< 7ll/4~i - aCJ + - mJmII 

= (1 + 7)11^-^11- ( 13 ) 



According to lfl2l l and fl3t , we derive the bound lfl4t as 
follows 

ll/4,d-/4,J < 7|ImLJ - aC,J 

= lKd-<J-(l-7)ll<i|-<J 

1 + 7 

which is bounded by some fraction (less than 1) of the 
difference between the message before and after the 
update d — Because we do not know the 

fixed-point /x* d in (fB)l , alternatively, we can maximize 
the corresponding difference \\^ w d — in order to 

minimize \\^ l w d — \i* w d ||. Notice that the difference is the 
definition of the message residual (|9). Therefore, if we 
always update and pass messages in the descending 
order of residuals 10 dynamically, the FBP algorithm 
will converge faster to the fixed-point fj,* than sBP |4) 
for learning LDA. 

2.2 Sparse Message Passing 

The message (|2) takes K iterations for both update and 
local normalization. When the number of topics K is 
large, for example, K > 100, the total number of 2K 
iterations is computationally large for each message. 
Fortunately, the message fJb w ,d{k) is very sparse [10], 
1 24 1 when K is large. In this case, we do not need 
to update and normalize all -ftf-tuple messages while 
retaining almost the same topic modeling accuracy. From 
residuals r d (k) in JTOb , FBP selects only the subset of 
topics r\K d with top residuals r d (k) for message updating 
and passing at each learning iteration, where 77 £ (0, 0.5] 
is the ratio parameter provided by the user. Therefore, 
FBP consumes only 2ijK iterations for message update 
and normalization, where 2-qK <C 2K. Obviously, the 
smaller the 77 the faster the FBP in practice. 

Intuitively, the residual reflects the convergence speed 
of each message. For example, r d (k) > r d (k') implies that 
the message on the topic k converges faster than that on 
the topic k' at the document d in the corpus. Similarly, 
Td > r d > implies that the message on the document 
d converges faster than those on the document d'. At 
each learning iteration, FBP updates and passes only 
the subset of fast convergent messages rjlC d , and keeps 
the subset of slow convergent messages unchanged. 
Because messages in FBP converge in most cases, those 
top residuals will gradually become smaller after several 
iterations, and thus can be ranked lower by sorting r d {k) 
in the descending order. As a result, those previously 
smaller residuals will be ranked higher for message 
updating and passing in later iterations. For example, 
in later iterations r d (k') > r d (k), and the message on 
the topic k' will be updated and passed. In this sense, 
FBP keeps all uncertainties and retains almost the same 
accuracy as the conventional BP algorithm H). 

Fig. Q] summarizes the FBP algorithm, where T is the 
total number of learning iterations. At the first iteration 
t = 1, FBP is the same with the conventional BP that 
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input : x W xd,K,T, a,p,rj. 
output : 9, <j>. 

1 a*° id (fe) <— random initialization and normalization; 

2 (j> w (k) <- Ed :c '",dM™,d( fc ); 

4 for d <- 1 to L> do 



for fc <— 1 to if do 

4> w ,-d(k) <— <^x.(fc) - x Wt dHm,d(k); 



[fl_«,,d(ft) + a] x 



<~ E tt) ZtMl/^dW - M° ,d(fc)l 
<— 4> w - d {k) + Xw,dfJ> Wl d(k); 
9 d (k) <— 9- w>d (k) + x w ,dl^w,d(k); 

end 

AC d <— SOrt(r d (fc), 'descend'); 



Era ra ,-dC=) + W/3' 



9 
10 

11 

12 
13 
14 
15 

16 end 

17 I? 1 <— SOrt(r d , 'descend' 

18 for i <- 2 to T do 

19 

20 
21 

22 



23 

24 
25 
26 
27 
28 
29 
30 
31 
32 
33 

34 end 



for d S £> t_1 do 
for k <= r^ -1 do 

4> w - d {k) <- - aCtM/t^W 

9_ w , d (k) <- 9 d (k) - x Wt d^d(k); 

Vt,d( k ) [S-w,d{k) + ot]x 

M„,d( fe ) 



Miu.dW 



E„^,-d(fc) + W/3' 



El- " 

E m s^.d |/4, d (fc) - (fc) 

<f>m,-d{k) + X w ,dHw,d(k); 

9- w , d {k) + x Wt dfJ,i, :d (k)', 

end 

K d <- SOrt(r d (fe), 'descend'); 

r$«-E* »■$(*): 

end 

I?' <— SOrt(r d , 'descend'); 
if IP' - P t_1 | < 1 then break; 



Fig. 1 . The FBP algorithm for LDA. 

updates and normalizes all messages for all topics (lines 
1-17). The only difference is the computation of residuals 
rd(k) and r<2, and the sort of residuals to get the descend- 
ing order JCd and V, respectively. For 2 < t < T , based on 
the descending order ICd and V, FBP selects the subset 
riKd topics for message updating and passing (lines 18- 
32). The message normalization for the the subset topics 
k 6 rjKLd is 

where \x l w d (k) and are the normalized message in 
the current and previous iterations, and y} w d {k) is the 
unnormalized message according to 101. In this way, we 
need only -qK iterations for local message normalization 
and avoid calculating the global normalization factor 
Z = J^k ^w,dik) with K iterations. Notice that we 
dynamically find the best ordering K\ and £>* to locate 



the fast convergent messages after each iteration (lines 
29 and 32). Finally, at the end of each iteration, we check 
if FBP converges in order to break the loop (line 33). In 
this paper, we terminate FBP if the difference of training 
perplexity |4), 

lo S [J2k°d{k)(j) w (k)] ] 
V = cxp( '■ — S , (16) 

[ l^w,d X w,d J 

of two successive iterations is less than one, because the 
training perplexity will decrease very little in the later 
iterations for convergence. 

The time complexity of FBP is 0(rjKDT), where K is 
the number of topics, D the number of documents, and 
T the number of iterations to convergence. Compared 
with other batch LDA algorithms such as VB 0, GS 
and BP |4J, the time complexity of FBP is scaled linearly 
by r\ g (0, 0.5]; therefore FBP is much faster than VB, 
GS and BP especially when the number of topics K is 
large. Compared with other sparse strategies like fast 
Gibbs sampling (FGS) [24J and sparse Gibbs sampling 
(SGS) H0l , FBP can control the balance of speed and 
accuracy by the ratio parameter 7/. Intuitively, the smaller 
rj would lead to a faster speed but a relatively lower 
accuracy However, our experiments have confirmed that 
i] = 0.1 is enough to achieve almost the same topic 
modeling accuracy as sBP H when K > 100. 

3 Online Belief Propagation 

Since FBP is a batch LDA algorithm, it cannot process 
big data streams owing to high memory costs and slow 
computation speed. In this section, we combine FBP with 
the online stochastic optimization method [22J referred 
to as OBP, which can converge to a stationary point of 
the LDA joint probability Q}. 

OBP decomposes the document-word matrix into a 
sequence of M = \_D/S\ mini-batches, i.e., xwxd = 

{ x wxs.---> x wxs>---. x wxs}' where L'J is the floor op- 
eration. In each mini-batch, there are 1 < s < S 
documents. When the data stream flows to the un- 
seen mini-batch x^ xS , all current online LDA algo- 
rithms [8J-[15] aim to maximize the joint probability 
p(x}J xS , z w/xsl^tu -1 (^)> a ' P) °f unseen documents con- 
ditioned on the previous topic distribution (75™~ 1 (fc). To 
maximize this conditional joint probability, OBP first 
randomly initialize and normalize the messages /i™ s for 
the unseen documents x™ s , and then update the topic 
distribution <77™(fc), 

C(*) = C -1 (*) + ( 17 ) 

s 

where <75™ _1 (fc) is the topic distribution of the previous 
mini-batch. Similarly, the document-specific topic pro- 
portion for unseen documents is initialized as 

cw-E<x, s - as) 

w 
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input : Xwxd, K, S, a, (3, 77. 
output : (j>w(k). 

1 M <- [D/Sj; 

2 <^i,(fc) <- FBP(xi, ja ,/ui,, s , K,a,(3,T]); 

3 for m <- 2 to A/ do 



/'« 



4 

5 
6 
7 

8 end 



random initialization and normalization; 



T™ ll" 1 

** J w,sf-*"w,s 



y s V"'/ / / W aj w,sH J w ,s * 

4>™(k) <- FBP(xJ? i „ A C,.,^S?(*),^?(A!),Jf,a, j 9,»j); 



Fig. 2. The OBP algorithm for LDA. 
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Fig. 3. (A) FBP and (B) OBP algorithms for learning LDA. 



Using {0 m ,6» m ,/j m } as initial values, OBP runs FBP as 
shown in Fig. [TJ until convergence. Notice that in OBP 
settings, FBP fixes and updates only messages 

fjL m for the current mini-batch x m in |(T7|| . Because the pre- 
vious topic distribution 0™ _1 (fc) provides the gradient 
descent for the message update in 0, FBP uses signifi- 
cantly less number of iterations T m until convergence. 
Since the batch FBP maximizes the joint probability 
p(x, z\a, /3) of LDA, OBP can maximize the conditional 
joint probability p(x m ,z m \(j) rn ~ 1 ,a, (3) for each unseen 
mini-batch. Moreover, OBP can detect topic shifts us- 
ing CLZh where 0™~ 1 (fc) is the topic distribution of the 
m — 1th mini-batch, and Yl s x w sl^w a i s the topic shift 
contributed by the mth mini-batch. 

Fig. |2] summarizes the OBP algorithm for learning 
LDA. For the first mini-batch, we run FBP to obtain the 
initial topic distribution <^ (fc) (line 2). For 2 < m < M 
mini-batches, OBP uses FBP to re-estimate the mth topic 
distribution 0™(fc) to convergence (line 7) after random 
initialization of messages and parameters for the previ- 
ously unseen mini-batch x m (lines 4-6). OBP from Fig. 
reduces to FBP (Fig. Q) if S = D. Fig. |3] illustrates the 
difference between FBP and OBP. In Fig. [3jA, FBP scans 
the entire document- word matrix itwxD by a total of T 
iterations until convergence. In contrast, OBP (Fig. [3)3) 
scans each mini-batch of sub-matrix x^ xS sequentially 
by a total of T m iterations until convergence. Because 
OBP fixes the previous topic distribution (j) m ~ l , and only 
updates messages /j™ for the unseen documents x™ s , it 
uses significantly less number of iterations until conver- 
gence, i.e., T m -C T. As a result, to extract topics from the 



entire document-word matrix, OBP is much faster than 
FBP. The time complexity of OBP is 0(i]KDT m ), which 
is much smaller than FBP's 0(r/KDT) because T m <C T. 
The space complexity of OBP is 0{SK), which is also 
much smaller than FBP's O(DK) because S < D. 

3.1 Analysis of Convergence 

In this subsection, we show that OBP in Fig. [2] converges 
to a stationary point of the joint probability p(x, z|a,/3) 
of LDA. First, FBP in Fig. CD can be viewed as a batch 
gradient descent algorithm that computes the average of 
gradients or messages (|2) to estimate the topic distribu- 
tion |[6). Each iteration of FBP involves a burden of com- 
puting the average gradient over the entire document- 
word matrix icwxD- To relieve this burden, OBP com- 
putes the online gradient descent over the small 
mini-batch x™ s to estimate the topic distribution fV7) . 
More specifically, we can re-write ((T7t as 

C(fc) = C _1 (fc) + — ^AC(fc), (19) 
m — 1 

where the notation A<j) m denotes the online gradient 
descent J2 S x w, s ^w. s that updated by FBP in Fig. [2] 
Eq. Jl9] | has a learning rate l/(m— 1) because </>™ -1 (/c) 
accumulates messages of previous m — 1 mini-batches, 
and Acf>™(k) accumulates only messages of the current 
mini-batch. Since this learning rate satisfies the following 
two conditions, 



Y — 

— ' HI — 1 



m=2 



E 



— ' (m — l) 1 

m=2 v * 



< OO, 



(20) 



(21) 



the online stochastic learning theory [22J shows that 
0™(fc) will converge to a stationary point of the LDA 
objective function {l}, and the gradient A</>™(fc) will 
converge to when m — > oo. 

Fig. |4] shows the hypergraph [26[ representation for 
the online LDA model. For each mini-batch x™ s , there 
is a collapsed LDA model represented by three hy- 
peredges {^™,0™,7™} denoted by yellow, green and 
red rectangles, which correspond to the three terms 
of the joint probability {l}, respectively. For example, 
the hyperedge 0™ describes the dependencies between 
the topic label z™'^ and its neighboring topic labels 

Zw',-8f an d it corresponds to the second term of the joint 
probability Q}. The notation —s means all document 
indices in the mth mini-batch excluding s. The online 
LDA model uses the hyperedge (the blue rectangle) 
to describe the dependencies between the successive 
hyperedges and 0™, corresponding to the online 

gradient descent (|T9] |. For the mth mini-batch, we use 
the FBP to infer the topic message of the variable z™' s fc 
using (0. The dependency fl9l l between two successive 
mini-batches m — 1 and m is denoted by the blue hy- 
peredge as shown in Fig. |U Since the mth mini-batch 



im j 77* — i 






Fig. 4. The hypergraph representation for the online LDA model. 



depends only on the m — 1th mini-batch, the online LDA 
model follows the first-order Markov assumption that 
has been widely used to model time series |27[. 

3.2 Relationship to Previous Algorithms 

Online LDA algorithms infer topics from unseen docu- 
ments in the data stream. However, inference for unseen 
documents has been already discussed in batch LDA 
algorithms |4|, 0. The predictive perplexity V on an 
unseen test set is a widely used performance measure to 
evaluate different batch LDA algorithms. To calculate the 
predictive perplexity, we fix the topic distribution cf> w (k) 
estimated from the training set, and run batch LDA algo- 
rithms to estimate 9 s (k) for 80% unseen test documents. 
The predictive perplexity on unseen documents is 



TABLE 1 

Statistics of four document data sets. 



V = exp 



£^ o >g[E fc fc(*0^(*0] 



„20% 



(22) 



where is the word counts for the remaining 20% 
unseen test documents. 

Similar to OBP, OVB [131 also integrates VB |2) with 
the online stochastic learning framework 1221 . It uses the 
following topic distribution update equation, 

$ m = (1 - f>m)$ m - 1 + PmA$ m , (23) 

Pm = (r +m)- K , (24) 

where tq and k are parameters provided by users. Since 
the learning rate Em=i Pm = °° and J2m=i Pm < °°> the 
analysis shows that OVB can converge to the objective 
function of VB. Indeed, OVB's learning rate p m is similar 
to OBP's learning rate (m — when to = and k = 1. 
From this perspective, the major difference between OBP 
and OVB is that the former is derived from FBP and the 
latter is derived from VB. Another difference is that OVB 
finishes scanning the mini-batch when the convergence 
of § a (k) is achieved in VB (13), while OBP uses the 
training perplexity Jl6l l as the convergence criterion. The 
residual VB (RVB) algorithm [14], fT5l is an important 
improvement of OVB. Through dynamically scheduling 
the order of mini-batches based on residuals, RVB is 
slightly faster and more accurate than OVB. OGS is 
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derived from the sparse GS (SGS) [ 10 1 . We find that 
the topic distribution update equation for unseen doc- 
uments is almost the same with (fl9b . So, OGS can also 
converge to the stationary point of the LDA objective 
function within the online stochastic optimization frame- 
work [22J. Sampled online inference (SOI) [12 1 explicitly 
combines SGS with the scalability of online stochastic 
inference l28l . Experiments show that SOI is around 
twice faster than OVB. The major difference between 
OGS/SOI and OBP is that OGS/SOI is built on SGS but 
OBP uses FBP to compute online gradient descents. 

4 Experiments 

The experiments are carried out on the four publicly 
available data sets ED: ENRON, WIKfl, NYTIMES and 
PUBMED in Table [TJ where D is the total number of 
documents and W is the vocabulary size. We randomly 
reserve a small proportion of documents as "Test" set, 
and uses the remaining documents as the "Online" 
training set. Due to memory limits for batch LDA al- 
gorithms, we randomly selects a subset of documents as 
"Batch" from the "Online" training set. Since ENRON is 
a relatively smaller data set, both "Batch" and "Online" 
contain the same number of documents. All experiments 
are run on the Sun Fire X4270 M2 server with two 6- 
core 3.46 GHz CPUs and 128 GB RAM. We use the 
training perplexity |(16)| and the predictive perplexity (|22|l 
as performance measures, which have been widely used 
in previous works 0, (4), Q, [13], EH, HQ. Generally, 
the lower predictive perplexity on the test set the better 
generalization ability. In all experiments, we fix the 
hyperparameters a = /3 = 0.01. 

1 . http : / / en. wikipedia. org / wiki / Data_set 
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Fig. 5. Training perplexity as a function of the ratio parameter rj when K e {100, 200, 300, 400, 500}. 
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4.1 Fast Belief Propagation (FBP) 

We compare the convergence speed of FBP with that of 
other state-of-the-art batch LDA algorithms, including 
FGS HO, sBP HH and VB 0. FGS can be viewed as a 
fast surrogate of the conventional GS algorithm [3 J. For 
a fair comparison, we implement all algorithms using 
the MEX C++/MATLAB R2010a 64-bit platform, which 
have been made publicly available [30[. We use the train- 
ing perplexity lfl6)l as the convergence criterion. If the 
training perplexity difference between two successive 
training iterations is less than one, the algorithm will 
be terminated. Fixing the estimated topic distributions 
if), we calculate the predictive perplexity (|22t on the test 
set to evaluate the topic modeling accuracy. 

First, we examine the ratio parameter r\ in FBP. Fig. [5] 
shows the training perplexity as a function of rj 6 
{0.1,0.2,0.3,0.4,0.5} when K e {100,200,300,400,500}. 
Obviously, there is no big difference when rj = 0.1 and 
T) = 0.5 especially when K — 500. This phenomenon 
shows that only a small subset of topics play the role 
when K is very large. Such a sparse property has been 



also used to speed up GS for training LDA recently [10], 
l24l . Further study shows that when rj > 0.3 FBP 
achieves even a slightly lower perplexity than the sBP 
algorithm. The major reason is that FBP converges faster 
than sBP to the lower perplexity. Notice that the learning 
time of FBP still scales linearly with K. When K is 
very large such as K > 2000, may be a constant, 
e.g., rjlCd — 50. In this case, the learning time of FBP 
is independent of K. This hypothesis, r\K,d = 50, is 
reasonable because usually a common word cannot be 
allocated to more than 50 topics. Users may set different 
T) for different speedup effects. To pursue the maximum 
speedup and to retain a comparable accuracy, we choose 
T) = 0.1 in the rest of our experiments. 

Fig. [6] shows the CPU time per iteration as a function 
of K for FBP, FGS, sBP and VB. The learning time of 
all algorithms increases linearly with K. When r\ = 0.1, 
FBP in theory requires only 1/10 learning time of sBP 
per iteration. However, FBP needs to sort and update 
residuals so that it on average consumes around 1/5 
learning time of sBP per iteration. Similar to FBP, FGS 
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Fig. 8. Predictive perplexity as a function of the number of topics K. 



also benefits from the sparse message passing process. 
But FGS also needs to update the upper bound of the 
local normalization factor of messages |24|, and is slower 
by a factor of around 2-3 than FBP. Although SGS ED 
is around twice faster than FGS, it is still slower than FBP 
per iteration according to the same benchmark FGS. In 
addition, FBP can be even faster by applying a smaller 
r\ when K is very large. Finally, VB is slowest due to the 
complicated digamma function computation [4], [7]. For 
a better illustration, we multiply VB's learning time per 
iteration by 0.2 denoted by 0.2a;. In conclusion, Fig. [6] 
confirms that FBP is at present one of the fastest batch 
LDA algorithms when K > 100. 

Fig. shows the convergence time of FBP, FGS, sBP 
and VB. Because FGS is a MCMC technique, it consumes 
more iterations for convergence. FGS's convergence time 
is comparable with that of sBP. Although VB uses the 
least number of iterations for convergence, it still has 
the longest convergence time largely attributed to the 
longest CPU time per iteration. For a better illustration, 
we multiply VB's convergence time by 0.3 denoted by 
0.3x. Consistent with Fig. [6j FBP is the fastest algo- 
rithm to convergence, and is around 3 times more faster 
than FGS and sBP. This fast convergence speed will be 
beneficial to the OBP algorithm. However, even for the 
small subset of PUBMED training data set in Table [TJ 
FBP still requires around 0.8 hour to convergence when 
K = 500, which means that FBP will use approximately 
80 hours to scan the entire PUBMED data set when 
K = 500 even if there were enough memory available. 
Similarly, FGS, sBP and VB will take around 290, 385 
and 1400 hours respectively to accomplish the same task, 
respectively. Such a time- and memory-consuming batch 
process motivates the fast OBP algorithm that requires a 
constant memory space. 

Fig. [8] shows the predictive perplexity of FBP, FGS, 
sBP and VB. We see that the predictive perplexity of 
FBP overlaps that of sBP, which demonstrates that by 
passing only the messages of the 10% subset of topics can 
achieve the same topic modeling accuracy as sBP. FGS 
is worse than both FBP and sBP by around 5% — 15%, 
partly because FGS uses the sampling technique without 
keeping all uncertainties in the message [ 4 1 . VB performs 
the worst and shows an overfitting phenomenon, where 
the predictive perplexity increases with the number of 



topics on NYTIMES and PUBMED data sets. For visual 
clarity, we multiply VB's perplexity by 0.8, 0.9, 0.8 and 
0.4 on four data sets, respectively. Because VB optimizes 
an approximate variational distribution to the joint dis- 
tribution of LDA (JJ, it introduces biases in variational 
message passing. The experimental results show that 
such biases cannot be ignored when the number of topics 
K is large on NYTIMES and PUBMED data sets. How- 
ever, the proper setting of hyperparameters can correct 
the biases in VB (7) . As a summary, FBP converges fastest 
with the highest topic modeling accuracy among several 
state-of-the-art batch LDA algorithms. 

4.2 Online Belief Propagation (OBP) 

We compare OBP with two state-of-the-art online LDA 
algorithms including OGS (10) and OVB UJ] All algo- 
rithms are also publicly available [30 [. For OVB, we use 
its default parameters To = 1024 and k — 0.5 in d24t [13|. 

First, we study the mini-batch size S in three on- 
line LDA algorithms. For three relatively smaller data 
sets ENRON, WIKI and NYTIMES, we examine S = 
{256, 512, 1024, 2048, 4096}. For the relatively larger data 
set PUBMED, we test S = {1024, 2048, 4096, 8192, 16384}. 
Fig. [9] shows the training time as a function of mini- 
batch size S in log-scale when K = 100. OVB is faster 
than OGS partly because OVB uses the convergence of 
the variational parameter 6 as the termination condition 
for each mini-batch. Although this convergence criterion 
makes OVB faster, it also leads to the worse predictive 
ability of OVB in Fig. [TOJ We see that the learning time of 
OGS and OBP increases slightly with the mini-batch size 
S. The reason is that when S — > D, OGS and OBP reduce 
to SGS and FBP for longer convergence time. In contrast, 
OVB's training time decreases on ENRON, WIKI and 
NYTIMES data sets when S increases. The major reason 
is that OVB converges with almost the same number of 
iterations for smaller S. When S increases, OVB will scan 
few mini-batches. When S becomes larger on PUBMED 
data set, OVB's training time first decreases and then 
increases as S increases. 

Fig. [10] shows the predictive perplexity as a function 
of mini-batch size S when K = 100. Both OGS and OBP 
lower the perplexity value when S increases, because 
larger mini-batch sizes will lead to more robust online 
gradient descent for higher topic modeling accuracy. 
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However, OVB performs worse when the mini-batch 
size increases. According to the analysis in OVB [13], 
this phenomenon may be attributed to the approximate 
objective function of VB. Online gradient descents based 
on small mini-batches may correct biases of the global 
gradient descent of VB. Following OVB |13J, to balance 
speed and topic modeling accuracy, we choose S — 1024 
in the rest of experiments. 

Fig. [TT] shows the training time as a function of the 
number of topics K. OBP, OGS and OVB use around 14, 
58 and 37 hours to scan the entire PUBMED data set 
when K = 500. The online learning time is significantly 
shorter than that of the corresponding batch algorithms. 



For example, OBP uses approximately 15% training time 
needed for FBP to scan the entire PUBMED data set. 
Among the three online LDA algorithms, OGS is the 
slowest because its MCMC sampling nature for slow 
convergence. OBP is faster than OVB partly because it is 
derived from the fast convergent FBP confirmed in Fig. 
Another reason is that OBP uses the sparse message 
passing method for a large number of topics. 

Fig. [12] shows the training time ratio over OBP. We see 
that OBP is around twice faster than OVB, and is 4 ~ 10 
times faster than OGS especially on the relatively larger 
data sets NYTIMES and PUBMED. Recently, RVB ffl . 
1 15] and SOI |12| are two important improvements over 
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OVB and OGS, respectively. Although RVB QD, E3 con- 
verges faster to a lower predictive perplexity than OVB, 
it still requires slightly more training time than OVB to 
scan all mini-batches because of additional scheduling 
costs. SOI Ifl2l speeds up OVB by a factor of 2, but it 
is still comparable or even slower than OBP according 
to the same OVB benchmark. Therefore, OBP is very 
competitive in speed to scan big data streams. 

Fig. [13] shows the predictive perplexity as a function 
of the number of topics K. Consistent with Fig. [8j OBP 
has the highest topic modeling accuracy For example, 
OBP is 15% - 40% better than OGS for different topics. 
OVB still performs the worst and shows the overfilling 
phenomenon when K increases on WIKI and PUBMED 
data sets. For visual clarity, we multiply OVB's perplex- 
ity by 0.6 denoted by 0.6x. Even if OVB is faster than 
OGS, its accuracy is far from that of OGS due largely 
to the approximate objective function of VB that leads 
to the inaccurate OVB algorithm. As the OVB extension, 
RVB Ull, lTT5l also suffers the same problem. Because 
SOI is derived from SGS IflOl , its accuracy is comparable 
with OGS that yields higher predictive perplexity than 
OBP Together with Fig. [TO we may reasonably conclude 
that OBP is one of the fastest online LDA algorithms 
and achieves high topic modeling accuracy for big data 



streams. 

Fig- HH shows the predictive perplexity as a function of 
seen documents (log-scale) when K = 100. We see that 
OBP, OGS and OVB can converge to a stationary point 
by scanning more mini-batches. We also show the pre- 
dictive perplexity of FBP, FGS and VB as a comparison. 
Notice that the predictive perplexity of VB and OVB is 
multiplied by 0.6 denoted by 0.6a:. For the ENRON data 
set, "Batch" and "Online" training sets are the same in 
Table [T] We see that online algorithms cannot converge 
to the same perplexity of batch algorithms using the 
same training set. One reason is that the batch gradient 
descent is more accurate than the online gradient descent 
in case of the same training data. When more training 
data are used such as WIKI, NYTIMES and PUBMED 
data sets, OBP can converge to almost the same or even 
lower predictive perplexity than FBP Similarly, OVB 
also converges to a lower perplexity than VB on the 
PUBMED data set. However, OGS always has a gap 
with FGS in terms of the predictive perplexity partly 
because OGS uses sampled word counts to compute the 
online gradient descent, which causes more stochastic 
noises (28). Fig. [14] shows that OBP in practice can 
converge to the stationary point of the LDA objective 
function by a series of online gradient descents (fl9b . This 
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Fig. 15. Topic shifts of OBP as seen documents increase on the WIKI data set. 



results is consistent with our convergence analysis in 
the subsection 13. II Obviously, OBP converges the fastest 
among all the three online LDA algorithms. 

Fig. [15] illustrates an example of topic shifts detected 
by OBP on the WIKI data set. We set the number of 
topics, K = 10, and show top ten words from three 
topics in Fig. [15] Topic 1 is about "system and software". 
When D — 1204, the word "program" is not in the top 
ten words. Gradually, when D = 3072 and 5120, the 
word "program" is ranked higher in this topic. Similarly, 
the word "computer" becomes important in this topic 
as seen documents increase. The word "network" first 
appears in this topic when D = 19456. On the other 
hand, the word "file" becomes unimportant as more doc- 
uments are scanned. Topic 2 is a "water" related topic. 
When D = 1024 and 3072, the word "lion" and "animal" 
are included in this topic. With more documents seen, 
"water" is closely related with "energy", "power" and 
"electricity", which implies more and more documents 
focus on water's energy property. Topic 3 is on "music". 
We see that the word "band" is ranked lower and lower 
as more documents have been seen. Two words "record" 
and "album" become more and more important in this 
topic. The ranking of most top words becomes stable 
when D = 19456, which implies that the topic distribu- 
tion converges to a stationary point. More generally, if 
we organize the data stream in the chronological order, 
OBP can detect the topic evolution as data stream flows. 



5 Conclusions 

This paper presents a novel OBP algorithm for learning 
LDA, which combines the fast convergent batch FBP 
algorithm with the online stochastic optimization frame- 
work. Not only can OBP time- and memory-efficiently 
process big text streams, but also can detect dynamic 
topic shifts as the data streams flow. OBP can converge to 
the stationary point of LDA objective function. Extensive 
experiments confirm that OBP is superior to the state- 
of-the-art OGS (lOl and OVB ]13J algorithms in terms 
of both speed and accuracy. To pursue further speedup 
effects, we may extend OBP on the parallel architec- 
tures JUL (19) . I3l11 . With the communication-efficient 
parallel topic modeling techniques, we can analyze mul- 
tiple data streams simultaneously, and find topic shifts 
and interactions among these data streams. 
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